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Abstract 

Fusion and inference from multiple and massive disparate data sources - the requirement for 
our most challenging data analysis problems and the goal of our most ambitious statistical pattern 
recognition methodologies - has many and varied aspects which are currently the target of intense 
research and development. One aspect of the overall challenge is manifold matching - identifying 
embeddings of multiple disparate data spaces into the same low-dimensional space where joint 
inference can be pursued. We investigate this manifold matching task from the perspective of 
jointly optimizing the fidelity of the embeddings and their commensurability with one another, 
with a specific statistical inference exploitation task in mind. Our results demonstrate when and 
why our joint optimization methodology is superior to either version of separate optimization. 
The methodology is illustrated with simulations and an application in document matching. 

1 Introduction 
1.1 Motivation 

Let (S, J 7 , V) be a probability space, i.e., S is a sample space, J 7 is a sigma-field, and V is a prob- 
ability measure. Consider K measurable spaces Si, ■ ■ ■ , S^ and measurable maps ir^ : S — > S&. 
Each 7Tfc induces a probability measure Vk on S&. We wish to identify a measurable metric space 
X (with distance function d) and measurable maps pk ■ S^ — >■ X, inducing probability measures 
Vk on X, so that for [x\, ■ ■ ■ , Xk]' G Si X ■ ■ ■ X Sx we may evaluate distances d(p^ (x^), Pk 2 ( x k 2 )) 
in X. See Figure 1. 

Given £1, £2 *~ V in S, we may reasonably hope that the random variable d(pk 1 ott^ (£1), pk 2 o 
7Tfc 2 (£i)) is stochastically smaller than the random variable d{pk 1 o ir^ (£1 ) , pt 2 07r fc 2 (£2))- That is, 
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matched measurements 71"^ (£1), 7Tfc 2 (£i) representing a single point £1 in S are mapped closer to 
each other than are unmatched measurements ir^ (£i), ^k 2 {^2) representing two different points 
in S. This property allows inference to proceed in the common representation space X. 

However, we do not observe £ G S; we also do not observe the = 7Tfc(£) G S^ directly, nor 
do we have knowledge of the maps 7Tfc. But suppose we have access to functions 4 : x 
R + = [0,oo) such that <5fc(vrfc(£i), ^(£2)) represents the "dissimilarity" of outcomes £1 and £2 
under map vr^. We propose to use sample dissimilarities for matched data in the disparate 
spaces Hfc to simultaneously learn maps pk which allow for a powerful test of matchedness in the 
common representation space X. 




HH - • • • HH -w- -r- 
' 1 ' 1\ 
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Figure 1: Maps 7r fc induce disparate data spaces S fe from "object space" S. Manifold matching involves 
using matched data {x^} to simultaneously learn maps pi, . . . ,px from disparate spaces Si, . . . , E K 
to a common "representation space" Af, for subsequent inference. 



1.2 Problem Formulation 

Consider n objects each measured under K different conditions, 

xn ~ • ■ ■ ~ x ik ~ • • • ~ x^, i = 1, . . . , n, 

where Xji ~ ••■ ~ 2:^ ~ ••• ~ aijx denotes K matched measurements tti (&),••• ,ttk(S,i) 
representing a single object £j G S, where S denotes the "object space". The assumption of 
different conditions implies that G S& where the spaces Si, ■ ■ ■ , Sx cannot be assumed to be 
similar. We are given K new measurements {yk\ k =ii V k ^ ^he question under consideration 
is: Does the collection {yk}k=i a ^ so correspond to matched measurements representing a single 
object measured under the K conditions? 

We use the S notation to remind the reader that the spaces S& cannot be assumed to be stan- 
dard finite-dimensional Euclidean spaces. We do assume that each space S& comes with a within- 
condition dissimilarity 5 k - a hollow, symmetric function from S& x S& to M + - through which the 
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matched data {xik} yields nxn dissimilarity matrices k = 1, ■ ■ ■ ,K. For new measurements 
{Hk\k=i we nave available for each k the within-condition dissimilarities 5k(yk, x^), i = 1, . . . , n. 

Remark 1: The and y^ are introduced mainly for symbolic purposes; the corresponding 
data may not be available or may be too complex to use directly, and we proceed from the 
dissimilarities. 

The specific statistical inference exploitation task we consider throughout most of this article 
is hypothesis testing. Our goal, simplified for the case K = 2, is to determine whether y\ and 
y2 are a match. That is, 

Ho ■ yi ~ 2/2 versus Ha : y\ y 2 , 

or equivalently, 

H Q :y l = 7Ti(£), 2/2 = vr 2 (0 versus H A : y 1 = 7Ti(£), 2/2 = vr 2 (0 for £ / £' G 5. 
(We control the probability of missing a true match.) 

1.3 Manifold Matching 

We define manifold matching as simultaneous manifold learning and manifold alignment - iden- 
tifying embeddings of multiple disparate data sources into the same low-dimensional space where 
joint inference can be pursued. Figure [T] depicts our framework. Conditional distributions are 
induced by maps tt^ from "object space" 3. Our assumption is that the conditional spaces H*. are 
not commensurate. For example, if the elements of H are individual people, then a photograph 
in image space Si and a biographical sketch in text document space S 2 are not to be directly 
compared. Indeed, our fundamental premise defining disparate data sources is that the various 
Hfc cannot profitably be treated as replicates of the same kind of space. Rather, the various 
spaces are different not just in degree but in kind. Each dissimilarity 5^ has been tailored for 
application to and it is inappropriate to apply 5k on x siy for k 1 ^ k. This distinguishes 
our data fusion from conventional multivariate analysis. 

In Figure [T] matched points {xik} are used to simultaneously learn appropriate maps 
taking the disparate data from the various into a common representation space X. These 
maps are then applied to {yk\k=i yielding j7fc = Pk(yk), whence (for K = 2) we use T = d(yi, y-i) 
as our test statistic and reject for T "large". 

Remark 2: Our convention is to use the " ~ " notation for points in the target space X, 
contrasted with no tilde for points in the original spaces. 

Remark 3: We will throughout consider the special case of X = M m for some pre-specified 
target dimension m. The fundamentally important and challenging task of choosing the target 
dimension - model selection - will be considered only as a confounding issue in this paper; m 
is a nuisance parameter which must be selected but whose selection is beyond the scope of this 
manuscript. 

1.4 What are these "conditions" and what does "matched" mean? 

As suggested above, one example of "conditions" involves photographs {ajji} and biographical 
sketches {a;^}, with "matched" Xn ~ 2^2 meaning that the photograph Xn and the biographical 
sketch Xi2 are of the same person. 

Other illustrative examples include: a general image & caption scenario, with "matched" 
meaning that they go together; multiple languages for text documents, with "matched" mean- 
ing on the same topic; multiple modalities for photographs (e.g., indoor lighting vs outdoor 
lighting, two cameras of different quality, or passport photos and airport surveillance photos), 
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with "matched" meaning of the same person; Wikipedia text document and Wikipedia hyperlink 
structure, with "matched" meaning of the same document. More generally, our framework may 
be applicable to any scenario in which multiple dissimilarity measures are applied to the objects 
at hand. 

Fundamentally, "matched" means whatever the training data say it means. We know it when 
we see it - or, perhaps more accurately, we know unmatched when we see it; see Figure [2] 
Consider, for instance, an example of multiple languages for text documents, with "matched" 
meaning on the same topic. Given English and French Wikipedia documents with the matching 
provided by Wikipedia itself, "matched" means "on the same topic." But of course the Wikipedia 
documents are not direct translations of one another, and documents in different languages on 
the same topic may have significant conceptual differences due to cultural differences, etc. 




Figure 2: An example of "not matched" for multi-lingual text documents. The English is clear 
enough to lorry drivers — but the Welsh reads "I am not in the office at the moment. Send any work 
to be translated." (See http://news.bbc.co.Uk/2/hi/uk_news/wales/7702913.stm, permission 



obtained from http : / / www . golwg360 . com/Haf an/ def ault . aspx ) 



1.5 Dirichlet Setting 

While the matched training data ultimately determine what "matched" means, in order to provide 
a clear mathematical characterization of matchedness we consider an illustrative Dirichlet setting. 
This setting is clearly overly simplified, but it invokes some aspects of the foregoing example of 
multiple languages for text documents. 

Let S p = {x G : __?=i x ? = 1} ^ e the standard p-simplex. We consider here the case 

Hi = S p and H2 = S p - the two spaces are, in fact, commensurate in this case, for illustration. 
Let "fi *~ Dirichlet(l) represent n "objects" or "topics". Let *~ Dirichlet(r'ji + 1) represent 
document i in language k. (Since the X^ take their value in S p , we can think of them as 
modelling (normalized) word count histograms with p+1 distinct words. Hi = H2 = S p suggests 
a simplified 1-1 word correspondence model. A permutation a indicating that the 1-1 word 
correspondence is unknown may be applied to the dimensions of one space with no alteration 
to our illustration.) In this case, r controls what it means to be matched - e.g., document 



4 



translation quality analogy. If r is large (highly accurate translations), then matched documents 
Xn and Xi2 will be probabilistically more similar than Xn and Xyi for i ^ i'; if r is small (rough 
translations), then "matched" doesn't mean much. Indeed, the limiting case of r — > oo (point 
masses) yields "matched" means "identical" while r = (recall that Dirichlet(l) is uniform 
on the simplex) yields "matched" means "no relationship". Figure [3] with p = 2, provides 
an illustration wherein matched means quite a lot. A real data version of this setting with 
multiple documents per topic is depicted in Figure [4] where three Linguistic Data Consortium 
(LDC) Enron email message topic classes are projected into the simplex S 2 via Fisher's Linear 
Discriminant composed with Latent Semantic Analysis (FLDoLSA) (see, e.g., [UEIE]). 




Figure 3: Illustrative Dirichlet setting wherein X^ ~ Dirichlet(r'ji + 1) represent documents i = 
l,...,n = 10 in languages k = 1,...,K = 2 in the standard 2-simplex S 2 . The parameter r 
controls the meaning of matchedness - the similarity of matched documents X a and X i2 compared 
to unmatched documents X a and JQ/ 2 for i ^ i' . 



1.6 Related Work 

The 2006 David Hand polemic [1] argued persuasively that a fundamental issue in statistical 
inference research and development - perhaps the fundamental issue - is robustness in the face 
of test data drawn from a distribution not the same as the distribution from which the training 
data are drawn. The disparate information fusion described above - combining multiple spaces 
with different characteristics - provides a setting for investigation of related issues. The recent 
survey (5j considers a wide range of examples and methodologies addressing this phenomenon in 
terms of transfer learning, domain adaptation, multitask learning, etc. The recent special issue 
[6] is devoted entirely to dimensionality reduction via subspace and submanifold learning. The 
majority of this article considers the Neyman-Pearson hypothesis testing setting, which provides 



clarity through the most straightforward of inference tasks. In Section 5.2 we briefly consider a 
ranking task. 

Our dissimilarity-centric approach is motivated by the 2005 Pekalska and Duin book j7] on 
the dissimilarity representation for pattern recognition and the far-reaching success of multidi- 
mensional scaling methodologies j8j [H2 [TT] 
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Figure 4: An example considering the FLDoLSA projection into S 2 of multiple Enron email mes- 
sages identified with three Linguistic Data Consortium (LDC) topics. The three colored scatterplots 
- yellow, red, purple - represent documents from the three topics; the green dots represent the 
topic means. We see that "matched", meaning "on the same topic", does mean something quite like 
Dirichlet(r'Ytopic + 1) in this case (but the variability "r" may be topic-dependent). 



Combining information from disparate data sources when the information in the various 
spaces is fundamentally incommensurate - that is, a separate collection of useful features can 
be extracted from each space but their interpoint geometry precludes profitable alignment in a 
common space - is considered via Cartesian product space embedding in jl2] . 

Preliminary development of our joint optimization methodology presented herein, as well as 
an application to classification tasks, is presented in |13j . 

1.7 Summary 

In Section 2 we frame the problem as an optimization problem, and lay the groundwork for the 
methodologies proposed in Section 3. Section 4 illustrates the methodologies with instructive 
simulations that illustrate characteristic behavior; in particular, a simulation involving Dirichlet 
random variables sets the stage for the experimental examples on text documents presented in 
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Section 5. Finally, Section 6 provides discussion and suggestions for several areas of continuing 
research. 



2 Fidelity and Commensurability 

As suggested in Figure [TJ our goal is to identify maps pk taking to W 71 (for some pre-specified 
m) such that (for K = 2) the power of the test, P[d(yi, y-z) > Cq\Ha '■ yi "° 2/2], is large, where 
the critical value c a is determined by the null distribution of the test statistic and the allowable 
Type I error level a. 

We proceed using £2 error for convenience and simplicity; clearly there is ample reason to 
consider other error criteria for particular applications. Similarly, we will assume symmetric 
dissimilarities 5k- 

The available matched points {2:^} are used to identify appropriate maps pk- Fidelity is 
how well the mapping Xik h-> Xik preserves original dissimilarities. The within-condition squared 
fidelity error is given by 

e /fc = 7SV (d(xik,x jk ) - 5 k {x ik ,x jk )) 2 

\2) l<i<j<n 

for each k. If the fidelity error is large, then it is likely that the mapping does not capture aspects 
of original data that may be needed for inference. 

On the other hand, even if all fidelity errors are small, inference may fail if d(yi, y-i) is large 
under the "matched" null hypothesis Hq : y\ ~ yi- Commensurability is how well the mappings 
preserve matchedness; the between-condition squared commensurability error is given by 

e c felfc2 =— ^ ] {d(Xik 1 ,Xik 2 ) — 5k 1 k 2 { x ik 1 T X ik2)) • 
l<i<n 

Alas, 5fc 1 fc 2 does not exist - we have no dissimilarity on Hfe a x H/% 2 . However, the concept of 
"matchedness" suggests that it might be reasonable to set 5k x ki{ x ikn ^ifo) = f° r an h &lj ^2, in 
which case the commensurability error is the mean squared distance between matched points - 
the same criterion optimized by the Procrustes matching employed below. 
There is also between-condition squared separability error given by 



e 



,2 J_ \ ^ 

s k 1 k 2 (n\ / j 

\2J \<i<j<n 



(,d(Xiki j *Ejk2 ) &kik2 ( X iki > X jk2 )) • 



However, it is less clear how to identify a reasonable stand-in for the 5k x k 2 terms in this expres- 
sion. We will return to this issue when presenting our joint optimization inference methodology 



proposal in Section 3.3 below. 

If all these errors are small - and if the target dimensionality is low enough so that estimation 
variance does not dominate (see e.g. |14j Section 3 and [15] Figure 12.1) - then successful inference 
in the target space may be achievable. The idea of the joint optimization method proposed in 



this manuscript (Section 3.3) is to attempt to minimize all three of these errors simultaneously. 



3 Inference Methodologies 

In this section we present three methodologies for performing our manifold matching inference - 
one which focuses on fidelity and is based on multidimensional scaling and Procrustes matching, 
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one which focuses on commensurability and is based on canonical correlation analysis, and then 
our proposal for joint optimization of fidelity and commensurability. 

Before proceeding, we briefly review multidimensional scaling, Procrustes matching, and 
canonical correlation analysis. 

Multidimensional scaling (MDS) takes an n X n dissimilarity matrix A = [<5y] and produces 
a configuration of n points x\, . . . ,x n in a target metric space endowed with distance function 
d such that the collection {d(xi, Xj)} agrees as closely as possible with the original {<%} under 
some specified error criterion; see for instance |9j [TUl E]. For example, £2 (also known as "raw 
stress") MDS minimizes Ylx<i<j< n (d{xi,Xj) — o~ij) 2 - 

Out-of-sample embedding is used throughout this paper - given a configuration {5j}™ =1 of the 
training observations and dissimilarities between test observations and the training observations, 
the test points are embedded into the existing configuration so as to be as -^-consistent as possible 
with these dissimilarities. This out-of-sample embedding can be one at a time, or jointly if the 
dissimilarities among multiple test observations are also available. Trosset and Priebe [16] present 
the out-of-sample methodology appropriate for classical MDS embeddings. We use raw stress 
embeddings herein, and the appropriate corresponding out-of-sample methodology is presented 
in (IT]. 

Procrustes matching |18| [19] 120] 121] takes two matched collections X\ and X' 2 of n points in 
M m and finds the rigid motion transformation which optimally aligns the two collections. For 
example, £2 Procrustes minimizes the Frobenius norm ||Xi — -X^QHf over all m x m matrices 
Q such that Q T Q = I. (We assume the dissimilarities have been scaled so that a scaling is not 
required in the Procrustes mapping. Thus Q defines a rigid motion mapping X' 2 "onto" X\. We 
address this issue briefly in Section 6.) 

Canonical correlation analysis (CCA) takes a collection X\ of n\ points in M mi and a collec- 
tion X2 of n,2 points in M.™ 2 and finds the pair of linear maps XJ\ : M. mi — > R and U2 : l™ 2 — > R 
which maximizes the correlation between X\ = U\(X\) and X2 = U^X^). Performing m iter- 
ations of this procedure in the successive orthogonal subspaces yields a CCA procedure which 
maps to R m . See, for instance, |22l [23] 123] . 

Let us now consider these tools as building blocks for manifold matching inference. 

3.1 Procrustes o MDS 

Multidimensional scaling yields low-dimensional embeddings. That is, A\ 1— > X\ and A 2 >— > X' 2 
yields rax m configurations. Procrustes (Xi, X' 2 ) yields 

Q* = arg min - X' 2 Q\\ F . 
Q T Q=I 

Given 5k(yk, x ik), i = l,...,n for k = 1,2, out-of-sample embedding of the test data gives 
JJi 1— ^ Vl, 2/2 ^ y'2 where the embedded points are chosen so that their distances to Xik agree 
as closely as possible with the available dissimilarities. Using the rigid motion transformation 
obtained in the Procrustes step, both y\ and t/2 = ({y'2)' 1 "Q*) T are i n ^ m with same coordinate 
system. Thus inference may proceed by rejecting for large values of d(yi,y2)- We dub this 
separate embedding approach "Procrustes composed with multidimensional scaling", or 11 pom". 

From an inspection of the raw stress multidimensional scaling criterion function, it follows 
immediately that the A^ 1— > X^ mappings minimize fidelity error. Thus we have established the 
following result: 

Theorem 1: pom optimizes fidelity without regard for commensurability. 
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That is, the maps pk are identified separately, with no concern for whether the commensu- 
rability optimization in the Procrustes step will be able to provide a good alignment. 



3.2 Canonical Correlation 

Since canonical correlation begins with Euclidean data, the first step of this methodology neces- 
sarily involves multidimensional scaling. This appears similar to Procrustes o MDS above, but 
in this case no attempt is made to achieve meaningful dimensionality reduction. Multidimen- 
sional scaling yields high-dimensional embeddings, Ai i— )■ X[ and A2 i-> X 2 , but in this case 
these maps are to the highest-dimensional space possible, W 1 ^ 1 in general. Canonical correlation 
finds linear maps to M m , U\ : X[ 1— > X\ and XJ 2 : X 2 1— > X 2 , to maximize correlation. Again, 
out-of-sample embedding yields (re — l)-dimensional points y\ 1— > y[, y 2 1— > y' 2 . Then y\ = Ufy'i 
and y2 = U 2 y' 2 can be directly compared. An investigation of the correlation criterion function 
shows that the CCA maps U\ and U2 minimize commensurability error, subject to linearity. 
Thus there is no need for Procrustes in this case, and once again inference may proceed: reject 
for large values of d(yi,y 2 ). We dub this approach "cca". 

From the equivalence of the correlation objective function and commensurability error, we 
have established the following result: 



Theorem 2: cca optimizes commensurability without regard for fidelity. 



That is, the maps are identified jointly, but with no concern for fidelity of the individual 
embeddings (beyond linearity). 



3.3 Omnibus Embedding 

In response to the optimization objectives of the two methodologies presented above - one con- 
sidering fidelity only and the other considering commensurability only - we develop an omnibus 
embedding methodology explicitly focused on the joint optimization of fidelity and commensu- 
rability. 
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Figure 5: Depiction of the 2n x 2n omnibus dissimilarity matrix M, including imputed dissimilarities 
W = [Si2(xii,Xj 2 )] and out-of-sample test data 2/1,2/2- 
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Under the "matched" assumption, we impute dissimilarities W = [d~i 2 (xn,Xj 2 )] to obtain a 
2n x 2n omnibus dissimilarity matrix M. See Figure [5] which depicts M as a block matrix 
consisting of the n x n dissimilarities matrices Aj and A2 on the diagonal and W as the n x n 
off-diagonal block. (This generalizes immediately to K > 2.) As discussed above, it seems 
reasonable under Hq to set the diagonal elements 5k 1 k 2 ( x iki > ^ifo ) of W to zero. (Notice, however, 
that 5k 1 k 2 { x ikn x ik2) = f° r ^1 7^ ^2 is n °t necessarily "truth." For instance, the Dirichlet setting 



of Section 1.5 with r < 00 would have non-zero elements for diagiW). Still, this "shrinkage" of 
diag(W) to zero seems reasonable.) As for the off-diagonal elements of W, we argue that either 
leaving them as missing data unused in the subsequent optimization or letting W = ( Ai + A2) /2 
are reasonable suggestions; we will return to this imputation issue later. Once we have settled 
on W, our approach considers MDS embedding of M as 2n points in M. m - zeros on the diagonal 
of W act to force matched points to be embedded near each other. It is clear that raw stress 
MDS applied to M has as its objective function precisely e\ + e 2 2 + e 2 12 + ej? i2 . If diagiW) = 
and the off-diagonal elements are treated as missing and ignored in the optimization, then this 
objective function reduces to a consideration of just fidelity and commensurability. 

Let ua = 5i(yi,xn) and v i2 = 5 2 (y 2 ,x i2 ). Under H Q , impute v a = 5i 2 (yx,x i2 ) an d u i2 = 
Si 2 (y 2 , Xn) via v\ = u 2 = (ui +v 2 )/2. Out-of-sample embedding of (uJ,v'[) T and (it^t'J) 2 " 
yields yi and y 2 . Reject for large values of d(yi,y 2 ). We dub this omnibus embedding approach 
for joint optimization of fidelity and commensurability "jofd\ 

Obviously, the choice of W is key for this joint optimization. Also, note that weights can be 
incorporated into the MDS optimization criterion; this weighting can become quite elaborate, 
but in its simplest form it yields a more general tradeoff between fidelity and commensurability 
via u(e 2 fi +e 2 h ) + (1 - u)e 2 c 
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4 Illustrative Simulation 

In this section we present an illustrative Dirichlet simulation which helps to elucidate when and 
why our joint optimization methodology is superior to either version of separate optimization. 

4.1 Dirichlet Product Model 

We describe a probability model with parameters p, q, r, a, and K. 

Let Sfc = S p+g , k = 1,2. Here the simplex S p encodes "signal" and the simplex S q en- 
codes "noise". That is, on S p we let -yi *~ Dirichlet(l) and mutually independent 



ik 



Dirichlet(r~fi + 1) (signal, as in Section 1.5 1 while on S q we let Xf k *~ Dirichlet(l) (pure 



noise). For a G [0, 1], let = [(1 — a)X^ k , aXf k ] - the concatenation of (weighted) signal and 
noise dimensions. The resultant distribution for (Xn, ■ ■ ■ ,Xik) is denoted by F P q r a j(, and 
^p, q,r,a, K\fi, ■■■ ,-y„ denotes the distribution conditional on the location of the ^j. 

4.2 Testing 

For each of n mc Monte Carlo replicates (n mc = 1000 in the simulations), we generate n matched 
pairs according to the Dirichlet product model distribution F p q ra K =2 by first generating 71, . . . , *y n 
and then, conditional on the collection {o^}, generating the matched pair (Xn,Xi 2 ). Embed- 
dings are defined for each of the three competing methodologies based on this matched training 
data. For each test datum under Hq, one new 7 is generated, a matched pair is generated, out- 
of-sample embedding is performed, and the statistic T = d(yi,y 2 ) is calculated; this is repeated 
s times independently (s = 1000 in the simulations) and the critical value c a for the allowable 
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Type I error level a is determined based on the Monte Carlo estimate of null distribution of 
T. Then unmatched pairs are generated, out-of-sample embedding is performed, and the statis- 
tic T is calculated for test data under Ha] this provides an estimate of the conditional power 

P[d(Vi,m) > c a |i?A,7l> • • -)7n]- 

We perform n mc Monte Carlo replicates to integrate out the 71, . . . , 7„, yielding comparative 
power estimates. We also investigate conditional power for particular collections {7^}, in order 
to better understand precisely when and why our joint optimization methodology is superior to 
either version of separate optimization. 

4.3 Results 

Figure [6] presents results from our Dirichlet product model. K = 2, with p = 3,q = 3, r = 
100, a = 0.1. The target dimension is m = 2. We use n = 100. The allowable Type I error level 
a is plotted against power f3 = P[d(yx,y2) > c a \HA\- The results are based on n mc = 1000 
Monte Carlo replicates with s = 1000; the differences in the curves are statistically significant. 
In this case, jofc with W = (Aj + &%)/2 is superior to both pom and cca. 




Figure 6: Dirichlet product model simulation results plotting the Type I error level a against 
power (3 = P[d(2/i, 2/2) > c a \H~A], indicating that jofc is superior to both pom and cca. See text for 
description. 
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4.4 Analysis 



The Dirichlet product model is designed specifically to illustrate when and why jofc is superior 
to both pom and cca in terms of fidelity and commensurability. 

If q is large with respect to the target dimensionality m, then with high probability cca will 
identify a m— dimensional subspace in the "noise" simplex S q with spurious correlation. This 
phenomenon requires only that a > 0. In this event, the out-of-sample embedding will produce 
arbitrary y\ and yi, even under Hq. Thus the null distribution of the test statistic will be 
inflated by these spurious correlations. If the allowable Type I error level is smaller than the 
probability of inflation, then the power of the cca method will be negatively affected. 

If a is small and m < p, then with high probability the m— dimensional subspaces identified by 
the MDS step will come from the "signal" simplex S p . If m < p, then with positive probability, 
these two subspaces, identified separately in pom, will be geometrically incommensurate (see 
Figure[7]). Thus the null distribution of the test statistic will be inflated by these incommensurate 
cases. If the allowable Type I error level a is smaller than the probability of inflation, then the 
power of the pom method will be negatively affected. 

For large q and small a, the two phenomena described above occur in the same model. The 
jofc method is not susceptible to either phenomenon: incorporating fidelity into the objective 
function obviates the spurious correlation phenomenon, and incorporating commensurability into 
the objective function obviates the geometric incommensurability phenomenon. Thus we can es- 
tablish that, for a range of Dirichlet product model distributions, jofc is superior to both pom 
and cca. 

Theorem 3: Let m £ {1, • • • ,min{p— l,^}}, a £ (0,1/2), and r G (0, oo). Then for 
large q, small a, and large r, there exists allowable Type I error level a > such that the 
Dirichlet product model distribution F pqtr a K=2 with target dimensionality m yields power 
fijofc > max{/3 P om, Pcca}, where power j3 = P[d(yi, y%) > c a \H^] for the various testing method- 
ologies jofc, pom, and cca. 

Proof: Let b\ denote the probability that cca suffers from the spurious correlation phe- 
nomenon, and let b% denote the probability that pom suffers from the geometric incommensu- 
rability phenomenon. Then q^> p implies that cca suffers from the spurious correlation phe- 
nomenon with high probability and thus b\ ~ 1 and f3 cca ~ a. For a ~ and r sufficiently large, 
jofc and pom identify approximately the same embeddings except for the cases in which pom 
suffers from the incommensurability phenomenon. Thus the null distribution of T = ^(2/1,3/2) 
for jofc is approximately point mass at zero while the null distribution of T for pom has 62 mass 
3> 0. Hence a ~ 62/2 yields /3,- / c ~ 1 while f3 pom ~ 1/2. ■ 

Delving into our simulation results via investigation of conditional power P[d{yi,y2) > 
Ca\HA, 71, . . . , 7„], it is apparent that the superiority of jofc is indeed due to occurrences of 
the phenomena described above - individual Monte Carlo replicates (particular selections of the 
{~~fi}, essentially) are identified in which the spurious correlation phenomenon causes poor per- 
formance for cca or the incommensurability phenomenon causes poor performance for pom but 
in which jofc is unaffected. 

We note that the Dirichlet product model introduced here as an aid in understanding when 
and why jofc is superior to both pom and cca does in fact (loosely) model general high-dimensional 
real data scenarios: many dimensions consisting mostly of noise along with a few signal dimen- 
sions. 
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Figure 7: Idealization of the incommensurability phenomenon: for a symmetric collection 
{71 j 72, 73, 74} in the simplex S 3 , all four of the facet projections have the same fidelity and are 
geometrically incommensurable with one another. 
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4.5 Gaussian Model 



A Gaussian model, analogous to the Dirichlet product model investigated above, is constructed 
here to provide a sense of the generality of models with many dimensions consisting mostly of 
noise along with a few signal dimensions. 

We consider p-dimensional means fi{ *~ M ^0, Ip^j , % = 1, • • • , n, analogous to the -y, from the 

Dirichlet model. Matchedness arises from independent Xj k ~ M f/Xj, r -1 Jp), i = 1, . . . , n, k = 
1, . . . K , for r G (0, oo); as r increases, the degree of matchedness increases. As before, we have 

^-dimensional "noise" vectors Xf k *~ N Again, for a £ [0,1], X,-^ = [(1 — a)Xj k , aXf k ] 

represents the concatenation of (weighted) signal and noise dimensions. As with the Dirichlet 
product model, both the spurious correlation phenomenon and the geometric incommensurability 
phenomenon are present in this Gaussian model. 

Figure [8] presents simulation results for this Gaussian model, entirely analogous to those 
depicted in Figure [6] 




Figure 8: Gaussian model simulation results plotting the Type I error level a against power f3 = 
P[d(yi,y 2 ) > c u \Ha], indicating jofc is superior to both pom and cca, entirely analogous to those 
presented for the Dirichlet product model in Figure [6j 
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5 Experimental Results 



5.1 Testing 

A collection of documents {xn}™ =1 are collected from the English Wikipedia, corresponding to 
the directed 2-neighborhood of the document "Algebraic Geometry." This yields n = 1382 and, 
through Wikipedia's own 1-1 correspondence, the associated French documents {a^liLi- F° r 
dissimilarity matrices A k , k = 1,2, we use the Lin & Pantel discounted mutual information 
[251 [26] and cosine dissimilarity 5 k (x ik ,x jk ) = 1 - (x ik ■ x jk ) / {\\x ik \\ 2 \\x jk \\ 2 ). 

Our results are obtained by repeatedly randomly holding out four documents - two matched 
pairs - and identifying the embeddings via cca, pom, and jofc based on the remaining n = 1380 
matched pairs. The two sets of held-out matched pairs are used as yi and y 2 , via out-of-sample 
embedding, to estimate the null distribution of the test statistic T = d(yi,y 2 ). This allows 
us to estimate critical values for any specified Type I error level. Then the two sets of held- 
out unmatched pairs are used as y\ and y 2 , via out-of-sample embedding, to estimate power. 
Target dimensionality m is determined by the Zhu and Ghodsi automatic dimensionality selection 
method [27J, resulting in m = 6 for this data set. 

Figure [9] plots the allowable Type I error level against power. These experimental results 
indicate that jofc is superior to both pom and cca, and are entirely analogous to the simulation 
results presented above. 



5.2 Ranking 

Here we consider a ranking task in which matched training data exists in disparate spaces Hi 
and 52, but test observation y 2 will be observed in space 'E 2 . The task is to find the match for 
y 2 amongst a candidate collection C = {yn, • • ■ ,y z i} C Si of z > 1 possibilities. Using the 
training set of matched observations, we identify the embeddings via cca, pom, and jofc, and 
out-of-sample embedding then yields y 2 and C = {yn,-- - ,y z i}- The rank r* of the one true 
match to y 2 amongst the candidate collection C in terms of {d(y(i,y 2 )}^ =1 is our measure of 
performance; r* = 1 represents perfect performance, r* = z/2 represents chance, and r* = z is 
the worst possible. 

For this experiment we consider a different collection of Wikipedia documents: all En- 
glish/Persian (Farsi) matched pairs (matched, again, through Wikipedia's own 1-1 correspon- 
dence) for which both documents in the pair contain at least 500 total words and at least 100 
distinct words. There are 2448 such pairs. (The word-count restrictions are to ensure that the 
documents are legitimate articles, rather than "stubs" - place-holders for future articles on the 
topic.) 



Figures 10 and 11 present notched boxplot experimental results wherein we repeatedly hold 



out z = 1000 matched pairs from the training set. (Recall that non-overlapping notches implies 



a statistically significant difference of means.) Figure 10 depicts r* as a function of target 
dimension m for jofc (gray) and pom (white). Performance improves for both methods as m 
increases from 5 to 25, with jofc superior. Performance levels off after m = 30 (and degrades 
significantly for m > 50). Figure 11 depicts difference in ranks, r* om — r* j c ; differences greater 
than indicate jofc superiority. 
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Figure 9: Experimental results on English/French Wikipedia documents plotting the Type I error 
level a against power /3 = P[d(yi, y 2 ) > c a \H^], indicating jofc is superior to both pom and cca. See 
text for description. 



6 Discussion and Conclusions 

We have presented a complete methodological core for manifold matching via joint optimiza- 
tion of fidelity and commensurability and comprehensive comparisons with either version of 
separate optimization. Continuing research includes comparison with other standard compet- 
ing methodologies, variations and generalizations of our omnibus embedding methodology and 
further theoretical developments. 

Here we discuss a few of the most pressing issues. 

K > 2 Conditions 

It is straightforward to generalize the omnibus dissimilarity matrix M to the case of K > 2 
conditions. 

Pre-Scaling the A k 

The scale of the various dissimilarities has been assumed to be consistent. For Dirichlet data, this 
assumption is warranted; however, pre-scaling of the prior to constructing M is imperative 
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Figure 10: Comparative rank experimental results depicting the rank r* of the one true match to 
test observation y 2 amongst the candidate collection C in terms of {d(y^i, 2/2)}^=! as a function of 
target dimension m. For each m G {5, 10, 15, • • • , 50}, there are two boxplots. These results indicate 
that jofc (gray) is superior to pom (white) on this data set. With z = 1000, both methods perform 
much better than chance (r* = z/2), although performance does not achieve perfection (r* = 1). See 
text for description. 



for the general case. 
MDS Objective 

Our omnibus embedding methodology can be employed with MDS criteria other than raw stress; 
the £2 criterion provides direct correspondence to fidelity and commensurability. Weighted £2 is 
straightforward. Other MDS minimization objectives have been studied in depth, and should in 
particular circumstances provide superior performance. 

Imputation of W 

It seems reasonable under Hq to set the diagonal elements S^^ip^ikxi x ik 2 ) °f W to zero. Recall, 
however, that this is not necessarily "truth;" the Dirichlet setting of Section [T3] with r < 00 would 
have non-zero elements for diagiW). Still, this shrinkage of diag(W) to zero seems reasonable. 
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Dimension 



Figure 11: Comparative rank experimental results depicting difference in ranks r* om — r* j c ; differ- 
ences greater than indicate jofc superiority. See text for description. 



However, there may be cases for which imputing non-zero values would be appropriate; for 
example, if information is available suggesting that some matchings are unreliable, then it might 
be advantageous to use larger values for these matchings. 

As for the off-diagonal elements of W, we have argued that either leaving them as missing data 
unused in the subsequent optimization or letting W = (Ai + £±<z)/2 are reasonable suggestions. 
We believe that more elaborate imputation should provide superior performance. In particular, it 
seems clear that choosing A G [0, 1] and setting W = AA 1 +(1-A)A 2 or W = (AAf +(1-A)A|) 1 / 2 
will be preferable in certain circumstances. 

Model Selection: The Choice of Target Dimensionality m 

We have assumed throughout that X = M. m for some pre-specified target dimension m. First, 
we note that, in general, embedding into target spaces other than Euclidean is possible and 
sometimes productive. More pressing is the necessity, in many applications, for data-driven 
choice of target dimension. This is in general a vexing model selection task - the bias-variance 
trade-off. Of course, m = 1 generally induces significant model bias and m = n — 1 generally 
admits excessive estimation variance, as characterized in [15] Figure 12.1. Many dimensionality 
selection methods based on the principle of diminishing returns in terms of variance explained 
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are available - in Section 5.1 we made use of the method proposed in [27J, and in 5.2 we presented 
results as a function of m. A dimensionality selection methodology specifically designed for use 
with our omnibus embedding methodology is of significant interest. 

One illustrative point in this regard is that the general commensurate-space approach con- 
sidered throughout this article - for all three approaches jofc, pom, and cca - adds a further 
complication with respect to identification of optimal target dimension: the optimal target di- 
mension m£ for the various A/% will not the be same. This adds to the degree of difficulty in 
designing methods for identifying the optimal common-space target dimension m* . 



Learning the 7Tk 

We have assumed that the maps 7Tfc from object space S to the conditional spaces are fixed 
(see Figure [TJ. Indeed, S and the ttj, have been treated as notional only. In some circumstances, 
it may be possible to use performance analyses to glean information concerning the induced 
conditional distributions and profitably adjust the irk, in a manner analogous to fusion frames 



Fast Omnibus Embedding 

Out-of-sample embedding of test data precludes re-learning the mappings for each inference. 
More importantly, it is straightforward to make a version of our omnibus embedding methodology 
fast (O(n)). Making an effective fast version requires numerous methodological choices for various 
stages of jofc. 



Commensurability Error vs Hausdorff Distance on G Pt7n 

In the simple setting of Euclidean spaces the pom methodology yields two elements of the 
Grassmann space G p ^ m of m-dimensional subspaces of W. This space is a manifold under 
the Hausdorff distance 2sin(#/2), where 6 is the canonical angle between subspaces |29| . Under 
special conditions the Hausdorff distance between pom's two subspaces and the commensurability 
error between their respective embeddings are closely related. 



See Figure 12 for a first example, from the Dirichlet product model simulation presented 
in Figure [6] Each point in Figure [12] represents a Monte Carlo replicate. We note that the 
Hausdorff distance between pom's two subspaces and the commensurability error between their 
respective embeddings are strongly correlated. Furthermore, the red points represent replicates 
for which the conditional power P[d(yi,y2) > c a \HA,~fi, ■ ■ ■ ,7n] is low ~~ predominantly those 
replicates for which Hausdorff distance and commensurability error are large. This demonstrates 
the effect of the incommensurability phenomenon on pom. The jofc embeddings are not subject 
to this deleterious phenomenon. 

Additional investigations concerning the superiority of jofc to pom due to the incommensu- 
rability phenomenon involve this relationship between Hausdorff distance and commensurability 
error. Significantly more involved investigations are required when, as is the case for proper 
text document analysis, one uses a more appropriate dissimilarity (Hellinger distance, or more 
generally a-divergence) on the simplex. 



Three- Way MDS 

Three-way MDS (see, for instance, addresses a problem superficially similar to joint opti- 
mization of fidelity and commensurability, in which a single configuration and two transformation 
matrices are identified from two dissimilarity matrices Ai, A2. It may be of interest to compare 
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Figure 12: Commensurability error and Hausdorff distance on the Grassmannian Manifold for our 
Dirichlet product model simulation (Figure [6]). Strong correlation is evident. Furthermore, the red 
points represent replicates for which the conditional power P[d(y\, 7/2) > Cq,\Ha, 7i, • • • , 7n] is l° w ~~ 
predominantly those replicates for which Hausdorff distance and commensurability error are large. 



and contrast our omnibus embedding methodology with various instantiations of three-way MDS 
- particularly the identity model presented in [30J. 

6 . 1 Conclusions 

In conclusion, we have presented an omnibus embedding methodology for joint optimization 
of fidelity and commensurability that allows us to address the manifold matching problem by 
jointly identifying embeddings of multiple spaces into a common space. Such a joint embedding 
facilitates statistical inference in a wide array of disparate information fusion applications. We 
have investigated this methodology in the context of simple statistical inference tasks, and com- 
pared and contrasted with competing fidelity-only and commensurability-only methodologies, 
demonstrating the superiority of our joint optimization. 

We have focused on a simple setting and simple choices for various methodological options. 
Many variations and generalizations are possible, but the presentation here provides the core 
methodological instantiation. 
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